Today we’ll be working with a dataset that I downloaded from the United States Drought Monitor.
First, let’s load the tidyverse
package. The tidyverse
package is a wrapper that loads a number of different useful packages with just one command. In particular, it loads the ggplot2
, dplyr
and tidyr
packages which we will be using. (Install the package with install.packages("tidyverse")
if you have not done so yet.)
library(tidyverse)
Next, run the following line of code to get the dataset as a data frame in R. read.csv()
is a function in base R that allows us to read in CSV files from the internet (or from your local disk). We specify the option stringsAsFactors = FALSE
because we want character variables to remain as characters.
df <- read.csv("http://web.stanford.edu/~kjytay/courses/stats32-aut2019/Session%206/ca_drought_data_20150920_20190920.csv",
stringsAsFactors = FALSE)
As usual, the str()
function gives us a good overview of our dataset:
str(df)
## 'data.frame': 12180 obs. of 13 variables:
## $ MapDate : int 20190917 20190910 20190903 20190827 20190820 20190813 20190806 20190730 20190723 20190716 ...
## $ FIPS : int 6001 6001 6001 6001 6001 6001 6001 6001 6001 6001 ...
## $ County : chr "Alameda County" "Alameda County" "Alameda County" "Alameda County" ...
## $ State : chr "CA" "CA" "CA" "CA" ...
## $ None : num 100 100 100 100 100 100 100 100 100 100 ...
## $ D0 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ D1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ D2 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ D3 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ D4 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ValidStart : chr "2019-09-17" "2019-09-10" "2019-09-03" "2019-08-27" ...
## $ ValidEnd : chr "2019-09-23" "2019-09-16" "2019-09-09" "2019-09-02" ...
## $ StatisticFormatID: int 2 2 2 2 2 2 2 2 2 2 ...
Before we start using the dataset for our analysis, we should do some sanity checks to make sure that the data we have is indeed the data we expect. For example: California has 58 counties, so we should have all 58 counties represented. We check this using summarize
and the function n_distinct
.
# Sanity check: should have all 58 counties
df %>% summarize(distinct = n_distinct(County))
## distinct
## 1 58
Let’s do another sanity check: Since we have 3 years of data, we should expect around \(52 \times 4 = 208\) entries for each county. We check this by counting the number of entries, then checking that 156
is the only value we receive among all the counts:
# Sanity check: each county should have ~156 observations
df %>% group_by(County) %>%
summarize(count = n()) %>%
distinct(count)
## # A tibble: 1 x 1
## count
## <int>
## 1 210
It’s not 208, but it’s close enough. It’s also a good sign that all the counties have the same number of observaations.
Some thoughts about the fields in our dataset:
MapDate
seems to be the same as ValidStart
, and ValidEnd
is always 6 days later than ValidStart
. We can keep only the ValidStart
column and treat that as the date which the data in that row is referring to.ValidStart
is a character variable, even though we want to treat it as a date. We can make this change by using as.Date()
.FIPS
and StatisticFormatID
(we don’t know what they mean anyway), and State
is the same for all values in our dataset, so we can safely drop these columns.None
to D4
always add up to 100%. For us, the percent area in drought is more useful, so we will make a new Drought
column for that.Let’s keep just the rows we want using select
in dplyr
. Type in the code below (notice how the code below changes the name for column ValidStart
at the same time):
# select important rows, change Date to date variable
df_levels <- df %>% select(County, Date = ValidStart, D0:D4) %>%
mutate(Date = as.Date(Date),
Drought = D0 + D1 + D2 + D3 + D4)
gather()
The chart we want to make display data for just Santa Clara county, so we should use dplyr
’s filter
to get a new dataset with just the observations from Santa Clara:
# filter for just Santa Clara County
df_county <- df_levels %>% filter(County == "Santa Clara County")
Let’s make a line plot of D0
against Date
:
ggplot(data = df_county) +
geom_line(mapping = aes(x = Date, y = D0))
To make an area plot instead, replace geom_line
with geom_area
:
ggplot(data = df_county) +
geom_area(mapping = aes(x = Date, y = D0))
How can we plot D0
and D1
against Date
? We could try this:
ggplot(data = df_county) +
geom_area(mapping = aes(x = Date, y = D0)) +
geom_area(mapping = aes(x = Date, y = D1))
There are a number of issues with this approach:
D0
to D4
, we need 3 more lines of code. What if there were 10 drought levels instead?
D0
even though the values plotted are for D0
and D1
.The reason we are facing this issue is because the values of drought level are column names. We need to reorganize the dataset such that there is a Drought level
column with values D0
to D4
, and a corresponding Percent of land area
which gives the percent of land area in that level of drought (In Hadley Wickham’s terminology, the original dataset is not “tidy” and we have to make it so.) We can use tidyr
’s gather
function to accomplish this.
# gather drought levels
df_county <- df_county %>% gather(D0:D4, key = "Drought level",
value = "Percent of land area")
Now we can easily plot D0
to D4
on the same chart (because our column names have spaces in them, we need to surround them with backticks (`) so that R interprets the code properly):
# make area plot
ggplot(data = df_county) +
geom_area(mapping = aes(x = Date, y = `Percent of land area`,
fill = `Drought level`))
Let’s use scale_fill_brewer
to make the area colors more theme-appropriate, and add a title:
# make area plot
ggplot(data = df_county) +
geom_area(mapping = aes(x = Date, y = `Percent of land area`,
fill = `Drought level`)) +
labs(title = "Drought levels in Santa Clara County") +
scale_fill_brewer(palette = "YlOrRd") +
theme(plot.title = element_text(face = "bold", hjust = 0.5))
(scale_fill_brewer
works in the same way as scale_color_brewer
, except that it refers to the “fill” aesthetic rather than the “color” aesthetic. Both of these functions work for discrete variables; for continuous variables, use scale_fill_distiller
and scale_color_distiller
instead.)
The plot we just made was for Santa Clara County. What if we were interested in another county, say Monterey County, instead? Do we have to type up all the code again? No: we could just look for all mentions of “Santa Clara County” and change them to “Monterey County”.
While this works, it is not advised: if “Santa Clara County” was used 50 times in our script, we’d have to find all 50 instances and change them. A better way to do this would be to assign the county to a variable right at the top of the script, and have all mentions of the county in the script be replaced by this variable.
The code below demonstrates this approach. (It assumes that we have already computed the data frame df_levels
.) Note that the changes are actually quite minimal. For the plot, we made use of the paste()
function to make the plot title dynamic.
county <- "Santa Barbara County"
# filter for the county, reshape data
df_county <- df_levels %>%
filter(County == county) %>%
gather(D0:D4, key = "Drought level",
value = "Percent of land area")
# make area plot
ggplot(data = df_county) +
geom_area(mapping = aes(x = Date, y = `Percent of land area`,
fill = `Drought level`)) +
labs(title = paste("Drought levels in", county)) +
scale_fill_brewer(palette = "YlOrRd") +
theme(plot.title = element_text(face = "bold", hjust = 0.5))
separate()
Let’s go back to the dataset df_levels
. I want a visualization that tells me how much drought the counties faced over the entire period in my dataset. As a proxy for “how much drought”, I will take the mean of the Drought
column across the whole period. For the visualization to be easier on the eyes, I will use coord_flip()
to flip the x and y axes:
df_levels %>%
group_by(County) %>%
summarize(mean_drought = mean(Drought)) %>%
ggplot() +
geom_col(aes(x = County, y = mean_drought)) +
coord_flip()
Note how the counties are arranged: in alphabetical order, with the smallest at the bottom. This is the default. If we want to arrange them in some other order, we’ll need to learn more about factors (which we will next session).
All the counties have the word “County” behind them. This is not very informative and it takes up space on our plot. We can use separate()
from the tidyr
package to help us remove them.
df_levels %>%
separate(County, into = c("County", "xx"), sep = " ") %>%
head()
## Warning: Expected 2 pieces. Additional pieces discarded in 2940 rows [1261,
## 1262, 1263, 1264, 1265, 1266, 1267, 1268, 1269, 1270, 1271, 1272, 1273,
## 1274, 1275, 1276, 1277, 1278, 1279, 1280, ...].
## County xx Date D0 D1 D2 D3 D4 Drought
## 1 Alameda County 2019-09-17 0 0 0 0 0 0
## 2 Alameda County 2019-09-10 0 0 0 0 0 0
## 3 Alameda County 2019-09-03 0 0 0 0 0 0
## 4 Alameda County 2019-08-27 0 0 0 0 0 0
## 5 Alameda County 2019-08-20 0 0 0 0 0 0
## 6 Alameda County 2019-08-13 0 0 0 0 0 0
We received a warning! That’s because some county names have more than one word (excluding “County”). To remove just the word “County” at the end, instead of separating on spaces, we can tell separate()
to separate the last 7 characters out. The minus sign tells separate
that we are counting from the right, not the left:
df_levels %>%
separate(County, into = c("County", "xx"), sep = -7) %>%
head()
## County xx Date D0 D1 D2 D3 D4 Drought
## 1 Alameda County 2019-09-17 0 0 0 0 0 0
## 2 Alameda County 2019-09-10 0 0 0 0 0 0
## 3 Alameda County 2019-09-03 0 0 0 0 0 0
## 4 Alameda County 2019-08-27 0 0 0 0 0 0
## 5 Alameda County 2019-08-20 0 0 0 0 0 0
## 6 Alameda County 2019-08-13 0 0 0 0 0 0
OK that looks like it works. Let’s put the rest of our data processing pipeline behind this:
df_levels %>%
separate(County, into = c("County", "xx"), sep = -7) %>%
group_by(County) %>%
summarize(mean_drought = mean(Drought)) %>%
ggplot() +
geom_col(aes(x = County, y = mean_drought)) +
coord_flip()
We can add a vertical line and some colors to the plot to make it more informative. Note that in the code we make a horizontal line instead: that’s because coord_flip()
will make it vertical.
df_levels %>%
separate(County, into = c("County", "xx"), sep = -7) %>%
group_by(County) %>%
summarize(mean_drought = mean(Drought)) %>%
ggplot() +
geom_col(aes(x = County, y = mean_drought, fill = mean_drought)) +
geom_hline(yintercept = 50, col = "red", lty = 2, lwd = 1) +
scale_fill_distiller(palette = "YlOrRd", direction = 1) +
coord_flip() +
theme(legend.position = "bottom")
dplyr
practiceBelow are some exercise for you to practice your dplyr
skills. All of them use the df_levels
dataset as the starting point. Solutions are in the section below.
Select the rows with D4 being 100.
Which counties experienced D4 being 100 on 2015-10-20?
Which counties experienced D4 being 100 on 2015-10-20? Give just the county names.
Which counties experienced 100% land area in D4 at any time? (Hint: instead of n_distinct, try distinct.)
Select rows with date “2018-09-18” and return the entries in descending order of D0.
Which county experienced the most number of weeks with 100% land area in D4?
dplyr
practice (solutions)df_levels %>% filter(D4 == 100)
df_levels %>% filter(D4 == 100 & Date == "2015-10-20")
## County Date D0 D1 D2 D3 D4 Drought
## 1 Alpine County 2015-10-20 0 0 0 0 100 100
## 2 Amador County 2015-10-20 0 0 0 0 100 100
## 3 Calaveras County 2015-10-20 0 0 0 0 100 100
## 4 El Dorado County 2015-10-20 0 0 0 0 100 100
## 5 Fresno County 2015-10-20 0 0 0 0 100 100
## 6 Kings County 2015-10-20 0 0 0 0 100 100
## 7 Madera County 2015-10-20 0 0 0 0 100 100
## 8 Mariposa County 2015-10-20 0 0 0 0 100 100
## 9 Merced County 2015-10-20 0 0 0 0 100 100
## 10 Mono County 2015-10-20 0 0 0 0 100 100
## 11 Nevada County 2015-10-20 0 0 0 0 100 100
## 12 Placer County 2015-10-20 0 0 0 0 100 100
## 13 Plumas County 2015-10-20 0 0 0 0 100 100
## 14 San Joaquin County 2015-10-20 0 0 0 0 100 100
## 15 Santa Barbara County 2015-10-20 0 0 0 0 100 100
## 16 Sierra County 2015-10-20 0 0 0 0 100 100
## 17 Stanislaus County 2015-10-20 0 0 0 0 100 100
## 18 Sutter County 2015-10-20 0 0 0 0 100 100
## 19 Tulare County 2015-10-20 0 0 0 0 100 100
## 20 Tuolumne County 2015-10-20 0 0 0 0 100 100
## 21 Ventura County 2015-10-20 0 0 0 0 100 100
## 22 Yuba County 2015-10-20 0 0 0 0 100 100
df_levels %>% filter(D4 == 100 & Date == "2015-10-20") %>%
select(County)
## County
## 1 Alpine County
## 2 Amador County
## 3 Calaveras County
## 4 El Dorado County
## 5 Fresno County
## 6 Kings County
## 7 Madera County
## 8 Mariposa County
## 9 Merced County
## 10 Mono County
## 11 Nevada County
## 12 Placer County
## 13 Plumas County
## 14 San Joaquin County
## 15 Santa Barbara County
## 16 Sierra County
## 17 Stanislaus County
## 18 Sutter County
## 19 Tulare County
## 20 Tuolumne County
## 21 Ventura County
## 22 Yuba County
df_levels %>% filter(D4 == 100) %>%
distinct(County)
## County
## 1 Alpine County
## 2 Amador County
## 3 Calaveras County
## 4 El Dorado County
## 5 Fresno County
## 6 Kings County
## 7 Madera County
## 8 Mariposa County
## 9 Merced County
## 10 Mono County
## 11 Nevada County
## 12 Placer County
## 13 Plumas County
## 14 San Joaquin County
## 15 Santa Barbara County
## 16 Sierra County
## 17 Stanislaus County
## 18 Sutter County
## 19 Tulare County
## 20 Tuolumne County
## 21 Ventura County
## 22 Yuba County
df_levels %>% filter(Date == "2018-09-18") %>% arrange(desc(D0))
Which county experienced the most number of weeks with 100% land area in D4?
df_levels %>%
filter(D4 == 100) %>%
group_by(County) %>%
summarize(count = n()) %>%
arrange(desc(count))
## # A tibble: 22 x 2
## County count
## <chr> <int>
## 1 Santa Barbara County 69
## 2 Ventura County 69
## 3 Fresno County 31
## 4 Kings County 31
## 5 Madera County 31
## 6 Mariposa County 31
## 7 Merced County 31
## 8 San Joaquin County 31
## 9 Stanislaus County 31
## 10 Tulare County 31
## # … with 12 more rows
sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.5
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] forcats_0.4.0 stringr_1.4.0 dplyr_0.8.3 purrr_0.3.2
## [5] readr_1.3.1 tidyr_0.8.3 tibble_2.1.3 ggplot2_3.2.1
## [9] tidyverse_1.2.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.2 RColorBrewer_1.1-2 cellranger_1.1.0
## [4] pillar_1.4.2 compiler_3.6.1 tools_3.6.1
## [7] zeallot_0.1.0 digest_0.6.20 lubridate_1.7.4
## [10] jsonlite_1.6 evaluate_0.14 nlme_3.1-140
## [13] gtable_0.3.0 lattice_0.20-38 pkgconfig_2.0.2
## [16] rlang_0.4.0 cli_1.1.0 rstudioapi_0.10
## [19] yaml_2.2.0 haven_2.1.1 xfun_0.9
## [22] withr_2.1.2 xml2_1.2.2 httr_1.4.1
## [25] knitr_1.24 vctrs_0.2.0 generics_0.0.2
## [28] hms_0.5.1 grid_3.6.1 tidyselect_0.2.5
## [31] glue_1.3.1 R6_2.4.0 fansi_0.4.0
## [34] readxl_1.3.1 rmarkdown_1.15 modelr_0.1.5
## [37] magrittr_1.5 backports_1.1.4 scales_1.0.0
## [40] htmltools_0.3.6 rvest_0.3.4 assertthat_0.2.1
## [43] colorspace_1.4-1 labeling_0.3 utf8_1.1.4
## [46] stringi_1.4.3 lazyeval_0.2.2 munsell_0.5.0
## [49] broom_0.5.2 crayon_1.3.4